{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# EBM Internals - Binary classification\n",
    "\n",
    "This is part 2 of a 3 part series describing EBM internals and how to make predictions. For part 1, click [here](./ebm-internals-regression.ipynb). For part 3, click [here](./ebm-internals-multiclass.ipynb).\n",
    "\n",
    "In this part 2 we'll cover binary classification, interactions, missing values, ordinals, and the reduced discretization resolutions for interactions. Before reading this part you should be familiar with [part 1](./ebm-internals-regression.ipynb)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# boilerplate\n",
    "from interpret import show\n",
    "from interpret.glassbox import ExplainableBoostingClassifier\n",
    "import numpy as np\n",
    "\n",
    "from interpret import set_visualize_provider\n",
    "from interpret.provider import InlineProvider\n",
    "set_visualize_provider(InlineProvider())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# make a dataset composed of an ordinal categorical, and a continuous feature\n",
    "X = [[\"low\", 8.0], [\"medium\", 7.0], [\"high\", 9.0], [None, None]]\n",
    "y = [\"apples\", \"apples\", \"oranges\", \"oranges\"]\n",
    "\n",
    "# Fit a classification EBM with 1 interaction\n",
    "# Define an ordinal feature with specified ordering\n",
    "# Limit the number of interaction bins to force a lower resolution\n",
    "# Eliminate the validation set to handle the small dataset\n",
    "ebm = ExplainableBoostingClassifier(\n",
    "    interactions=1,\n",
    "    feature_types=[[\"low\", \"medium\", \"high\"], 'continuous'], \n",
    "    max_interaction_bins=4,\n",
    "    validation_size=0, outer_bags=1, max_rounds=900, min_samples_leaf=1, min_hessian=1e-9)\n",
    "ebm.fit(X, y)\n",
    "show(ebm.explain_global())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br/>\n",
    "<br/>\n",
    "<br/>\n",
    "<br/>\n",
    "<br/>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(ebm.classes_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Like all scikit-learn classifiers, we store the list of classes in the ebm.classes_ attribute as a sorted array. In this example our classes are strings, but we also accept integers as we'll see in [part 3](./ebm-internals-multiclass.ipynb)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(ebm.feature_types)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this example we passed feature_types into the \\_\\_init\\_\\_ function of the ExplainableBoostingClassifier. Per scikit-learn convention, this was recorded unmodified in the ebm object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(ebm.feature_types_in_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The feature_types passed into \\_\\_init\\_\\_ were actualized into the base feature types of ['ordinal', 'continuous']. Following the spirit of scikit-learn's [SLEP007 convention](https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep007/proposal.html), we recorded this in ebm.feature_types_in_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(ebm.feature_names_in_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since we did not specify feature names, some default names were created for the model. If we had passed feature_names to the \\_\\_init\\_\\_ function of the ExplainableBoostingClassifier, or if we had used a Pandas dataframe with column names, then ebm.feature_names_in_ would have contained those names."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(ebm.term_features_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our model contains 3 additive terms. The first two terms are the main effect features, and the 3rd term is the pairwise interaction between the individual features. EBMs are not limited to only main and pair effects. We also support 3-way interactions, 4-way interactions, and higher order interactions as well. If there were any higher order interactions in the model, they would be listed in ebm.term_features_ as further tuples of indexes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(ebm.term_names_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "ebm.term_names_ is a convenience attribute that joins ebm.term_features_ and ebm.feature_names_in_ to create names for each of the additive terms.\n",
    "\n",
    "ebm.term_names_ is the result of:\n",
    "\n",
    "term_names = [\" & \".join(ebm.feature_names_in_[i] for i in grp) for grp in ebm.term_features_]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(ebm.bins_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "ebm.bins_ is a per-feature attribute. As described in [part 1](./ebm-internals-regression.ipynb), ebm.bins_ defines how to bin both categorical ('nominal' and 'ordinal') and 'continuous' features.\n",
    "\n",
    "For categorical features we use a dictionary that maps the category strings to bin indexes.\n",
    "\n",
    "As described in [part 1](./ebm-internals-regression.ipynb), continuous feature binning is defined with a list of cut points that partition the continuous range into regions. In this example, our dataset has 3 unique values for the continuous feature: 7.0, 8.0, and 9.0. Similarly to [part 1](./ebm-internals-regression.ipynb) the main effects in this example have 2 bin cuts that separate these into 3 regions. In this example, the bin cuts for main effects are again 7.5 and 8.5.\n",
    "\n",
    "EBMs support the ability to reduce the binning resolution when binning a feature for interactions. In the call to \\_\\_init\\_\\_ for the ExplainableBoostingClassifier, we specified max_interaction_bins=4, which limited the EBM to creating just 4 bins when binning for interactions. Two of those bins are reserved for 'missing' and 'unknown' values, which leaves the model with 2 bins for the remaining continuous feature values. We have 3 unique values in our dataset though, so the EBM is forced to decide which of these values to group together and choose a single cut point that separate them into the 2 regions. In this example, the EBM could have chosen any cut point between 7.0 and 9.0. It chose 8.5, which puts the 7.0 and 8.0 values in the lower bin and 9.0 in the upper bin.\n",
    "\n",
    "The binning definitions for main effect and interactions are stored in a list for each feature in the ebm.bins_ attribute. In this example, ebm.bins_[1] contains a list of arrays: [array([7.5, 8.5]), array([8.5])]. The first array of [7.5, 8.5] at ebm.bins_[1][0] is the binning resolution for main effects. The second array of [8.5] at ebm.bins_[1][1] is the binning resolution used when binning for interactions.\n",
    "\n",
    "The binning resolution does not stop at pairs. If an even lower resolution is desired for triples, then there would be a 3rd array of bin cuts included in the list. The last item in the list is the binning resolution used for all interaction orders higher than that position. If the EBM had contained just the binning resolution of [7.5, 8.5] in the list, then that resolution would be used for main effects, pairs, triples, and higher order interactions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(ebm.term_scores_[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "ebm.term_scores_[0] is the lookup table for the first feature in this example. Since the first feature is an ordinal categorial, we use the dictionary {'low': 1, 'medium': 2, 'high': 3} to lookup which bin to use for each categorical string. If the feature value is NaN, then we use the score at index 0. If the feature value is \"low\", we use the score at index 1. If the feature value is \"medium\", we use the score at index 2. If the feature value is \"high\", we use the score at index 3. If the feature value is anything else, we use the score at index 4.\n",
    "\n",
    "In this example the 0th bin score is non-zero because we included a missing value in the dataset for this feature."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(ebm.term_scores_[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "ebm.term_scores_[1] is the lookup table for the second feature in this example. Since the second feature is a continuous feature, we use cut points for binning. The 0th bin index is again reserved for missing values, and the last bin index is again reserved for unknown values. In this example, the 0th bin score is non-zero because we included a missing value in the dataset for this feature.\n",
    "\n",
    "The ebm.bins_[1] attribute contains a list having 2 arrays of cut points. In this case we are binning a main effects feature, so we use the bins at index 0, which is ebm.bins_[1][0]."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(ebm.term_scores_[2])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "ebm.term_scores_[2] is the lookup table for the pair composed of both features in this example. The features involved in the pair can be found at ebm.term_features_[2]. The pair lookup table is two dimensional, so indexing into it requires two indexes. The first index will be the bin index of the first feature, and the second index will be the bin index of the second feature. Example:\n",
    "\n",
    "pair_scores = ebm.term_scores_[2]\n",
    "\n",
    "local_score = pair_scores[(feature_0_index, feature_1_index)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Sample code</h2>\n",
    "\n",
    "Finally, here's some code which puts the above considerations together into a function that can make predictions for simplified scenarios. This code does not handle things like regression, multiclass, unknown values, or interactions beyond pairs.\n",
    "\n",
    "If you need a drop-in complete function that can work in all EBM scenarios, see the multiclass example in [part 3](./ebm-internals-multiclass.ipynb) which handles regression and binary classification in addition to multiclass and all the other nuances."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sample_scores = []\n",
    "for sample in X:\n",
    "    # start from the intercept for each sample\n",
    "    score = float(ebm.intercept_)\n",
    "    \n",
    "    # We have 3 terms: two main effects, and 1 pair interaction\n",
    "    for term_idx, features in enumerate(ebm.term_features_):\n",
    "        # indexing into a tensor requires a multi-dimensional index\n",
    "        tensor_index = []\n",
    "\n",
    "        # main effects will have 1 feature, and pairs will have 2 features\n",
    "        for feature_idx in features:\n",
    "            feature_val = sample[feature_idx]\n",
    "            bin_idx = 0  # if missing value, use bin index 0\n",
    "\n",
    "            if feature_val is not None and feature_val is not np.nan:\n",
    "                # we bin differently for main effects and pairs,\n",
    "                # so determine which resolution is needed\n",
    "                if len(features) == 1 or len(ebm.bins_[feature_idx]) == 1:\n",
    "                    # this is a main effect or only one bin level\n",
    "                    # is available, so use the highest resolution bins\n",
    "                    bins = ebm.bins_[feature_idx][0]\n",
    "                elif len(features) == 2 or len(ebm.bins_[feature_idx]) == 2:\n",
    "                    # use the lower resolution bins\n",
    "                    bins = ebm.bins_[feature_idx][1]\n",
    "                else:\n",
    "                    raise Exception(\"Unsupported bin resolution\")\n",
    "\n",
    "                if isinstance(bins, dict):\n",
    "                    # categorical feature\n",
    "                    bin_idx = bins[feature_val]\n",
    "                else:\n",
    "                    # continuous feature\n",
    "                    # add 1 because the 0th bin is reserved for 'missing'\n",
    "                    bin_idx = np.digitize(feature_val, bins) + 1\n",
    "\n",
    "            tensor_index.append(bin_idx)\n",
    "        # local_score is also the local feature importance\n",
    "        local_score = ebm.term_scores_[term_idx][tuple(tensor_index)]\n",
    "        score += local_score\n",
    "    sample_scores.append(score)\n",
    "\n",
    "logits = np.array(sample_scores)\n",
    "\n",
    "# use the sigmoid function to convert the logits into probabilities\n",
    "probabilities = 1 / (1 + np.exp(-logits))\n",
    "\n",
    "print(\"probability of \" + ebm.classes_[1])\n",
    "print(ebm.predict_proba(X)[:, 1])\n",
    "print(probabilities)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For regression, our default link function was the identity link function, so the scores were the actual predictions.\n",
    "\n",
    "For classification, the scores are logits and we need to apply the inverse link function to calculate the probabilities. For binary classification the inverse link function is the sigmoid function.\n",
    "\n",
    "Identically to regression in [part 1](./ebm-internals-regression.ipynb), the 'local_score' variable contains the values shown for the local explanations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "show(ebm.explain_local(X, y), 0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br/>\n",
    "<br/>\n",
    "<br/>\n",
    "<br/>\n",
    "<br/>\n"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
